Urban Biodiversity Atlas & Habitat Connectivity
Use case scenario
As a: City of Melbourne urban planner, environmental advocate, or community member interested in enhancing urban biodiversity and ecological connectivity.
I want to: Identify biodiversity cold-spots, map habitat gaps, and recommend targeted planting or habitat interventions to reconnect fragmented green spaces.
So that I can: Support evidence-based greening initiatives, foster community stewardship, and improve the ecological health and liveability of Melbourne’s urban environment.
By:
- Integrating open datasets on trees, canopy cover, invertebrates, open spaces, and barriers.
- Computing insect biodiversity metrics and identifying ecological hot-spots and cold-spots.
- Modelling habitat connectivity and recommending priority actions for planting or habitat structures.
- Publishing actionable insights through interactive dashboards and open APIs.
What this use case will teach you
- How to ingest, clean, and spatially analyse diverse urban ecological datasets using Python and geospatial libraries.
- Techniques for quantifying species richness, diversity, and identifying ecological clusters in an urban context.
- Approaches to model habitat connectivity and prioritise interventions using spatial analysis and predictive modelling.
- Best practices for communicating actionable insights through interactive dashboards and open data APIs.
- The societal and environmental impact of data-driven urban greening strategies.
Objectives
- Map current biodiversity and diagnose habitat gaps across the City of Melbourne.
- Prioritise micro-locations for targeted planting or habitat structures to enhance species connectivity.
- Develop and publish an interactive Urban Biodiversity Atlas dashboard and open API.
- Empower council and community stakeholders with actionable, evidence-based recommendations for urban greening.
Initialisation¶
Importing necessary libraries
import warnings
warnings.filterwarnings("ignore")
# Enable inline plotting in Jupyter notebooks
%matplotlib inline
# Data handling
import pandas as pd # Data manipulation (e.g. reading CSVs, merging sensor & API data)
import numpy as np # Numerical operations (e.g. statistics, array maths)
import math # Maths functions (e.g. sqrt for computing buffer radii)
import json # Parse JSON responses from Open Meteo or other APIs
from io import StringIO # In-memory text I/O (e.g. loading CSV data from a string)
import re # Regular expressions for text processing (e.g. extracting genus from notes)
# Geospatial processing
import geopandas as gpd # GeoDataFrames for shapefiles & GeoJSON
from shapely.geometry import Point, shape # Create/manipulate geometric objects (sensor points, canopy polygons)
from geopy.distance import geodesic # Calculate great-circle distances (e.g. Haversine formula for 50 m radius)
# Static & interactive mapping
import contextily as ctx # Basemap tiles for GeoPandas plots (e.g. OSM background)
import folium # Interactive leaflet maps in Jupyter (e.g. pan/zoom sensor coverage)
from folium.plugins import MarkerCluster # Interactive map with Marker
from branca.element import Template, MacroElement # Overlay Legend on Folium Maps
# Visualisation
import matplotlib.pyplot as plt # Static charts (e.g. bar plots, heatmaps)
import seaborn as sns # Statistical viz (e.g. correlation matrix heatmap)
import plotly.express as px # Interactive plots (e.g. time-series of PM₂.₅)
# HTTP requests with caching & retries
import requests # API calls (e.g. fetch tree-canopy GeoJSON)
import requests_cache # Cache API responses (avoid repeated rate limits)
from retry_requests import retry # Retry logic (e.g. for transient network errors)
import openmeteo_requests # Client for Open Meteo weather & air-pollution API
# Notebook display helpers
from IPython.display import IFrame # Embed HTML (e.g. folium maps) directly in cells
# Utility data structures
from collections import defaultdict # Default dictionaries (e.g. grouping counts by schedule)
# Machine learning pipeline
from sklearn.pipeline import Pipeline # Chain preprocessing & model steps
from sklearn.impute import SimpleImputer # Handle missing values (e.g. fill NaNs in PM₂.₅)
from sklearn.preprocessing import OneHotEncoder# Encode categorical features (e.g. month → dummy vars)
from sklearn.ensemble import RandomForestRegressor # Bagging-based ensemble regressor
from xgboost import XGBRegressor # Gradient-boosting regressor
from sklearn.model_selection import RandomizedSearchCV, GroupKFold # Hyperparameter search & grouped CV
from sklearn.metrics import mean_squared_error, r2_score # Evaluation metrics (e.g. RMSE, R²)
import joblib # Save/load trained models (e.g. persist best model)
from geopy.geocoders import Nominatim # Geocoding (e.g. convert addresses to coordinates)
from geopy.extra.rate_limiter import RateLimiter # Geocoding rate limiter (e.g. avoid exceeding API limits)
Importing the data through API from open data portal of Melbourne¶
The below function accesses open datasets via API endpoints, enabling users to obtain information in CSV format suitable for in-depth analysis. By providing the dataset identifier and a valid API key, it issues a request to the Melbourne data portal and interprets the response to retrieve pertinent data. This method streamlines the incorporation of diverse datasets—such as microclimate sensor, urban tree canopies, and tree planting zone, facilitating straightforward access and efficient data integration for applications in urban planning research.
def import_data(datasetname): # pass in dataset name and api key
"""
Imports a dataset from the City of Melbourne Open Data API.
Parameters:
- dataset_id (str): The unique dataset identifier.
Returns:
- pd.DataFrame: The imported dataset as a pandas DataFrame.
"""
dataset_id = datasetname
base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
apikey = <your API key> #Insert your API key
dataset_id = dataset_id
format = 'csv'
url = f'{base_url}{dataset_id}/exports/{format}'
params = {
'select': '*',
'limit': -1, # all records
'lang': 'en',
'timezone': 'UTC'
}
# GET request
response = requests.get(url, params=params)
if response.status_code == 200:
# StringIO to read the CSV data
url_content = response.content.decode('utf-8')
datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
print(f' Imported the {dataset_id} dataset with {len(datasetname)} records succesfully \n')
return datasetname
else:
return (print(f'Request failed with status code {response.status_code}'))
Datasets¶
Melbourne’s liveability depends not only on its built form but also on the ecological health of its parks, streets and private gardens. Insects, birds and bats pollinate plants, recycle nutrients and form the base of urban food-webs, yet their habitat is fragmented by roads and dense development. In line with Chameleon’s mission to “enhance life through the application of smart-city technologies” and the goal of showcasing practical applications of City-of-Melbourne (CoM) open data, this use case will build an Urban Biodiversity Atlas that quantifies species richness, pin-points ecological “cold-spots” and recommends green corridors that reconnect them. Insights will guide both council planting programmes and community-led actions such as pollinator gardens and nesting-box installations.
Below are the primary datasets used
| Theme | Dataset | Source | Key fields / notes |
|---|---|---|---|
| Flora structure | Trees with Species & Dimensions (Urban Forest) – ~70 000 street & park trees | Melbourne open data (link) | Species, DBH, life-stage, health, location |
| Canopy extent | Tree Canopies 2021 (Urban Forest) | Melbourne open data (link) | High-resolution canopy polygons |
| Invertebrates | Insect Records – “Little Things that Run the City” | Melbourne open data (link) | Insect species, abundance, sampling site |
| Barriers | 2020 Building Footprints | Melbourne open data (link) | Footprint polygons for least-cost analysis |
Importing dataset - insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city¶
About the dataset: This dataset contains detailed insect records from "The Little Things that Run the City" project - a critical resource for understanding urban biodiversity patterns across Melbourne. The collection includes identified insect species found across various parks and gardens, providing baseline data for mapping biodiversity hotspots and analysing habitat connectivity. Field surveys were conducted between October 2014 and March 2015, with species identification completed between April and September 2015. Understanding insect diversity is fundamental to developing targeted habitat interventions and measuring the ecological health of Melbourne's urban environment.
# Importing insect records dataset
insect_records = 'insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city'
df_insect_records = import_data(insect_records)
df_insect_records.to_csv('df_insect_records.csv', index=False) # saving into a local file
df_insect_records_orig = df_insect_records #saving the original dataset
print('First few rows of the dataset:\n')
df_insect_records.head()
Imported the insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city dataset with 1295 records succesfully First few rows of the dataset:
| taxa | kingdom | phylum | class | order | family | genus | species | identification_notes | location | sighting_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Insect | ANIMALIA | ARTHROPODA | INSECTA | HYMENOPTERA | PTEROMALIDAE | NaN | NaN | Pteromalidae 4 | Fitzroy-Treasury Gardens | NaN |
| 1 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | PYRGOTIDAE | NaN | NaN | Pyrgotidae 1 | Royal Park | NaN |
| 2 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | SCENOPINIDAE | NaN | NaN | Scenopinidae 2 | Royal Park | NaN |
| 3 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | SEPSIDAE | NaN | NaN | Sepsidae 1 | Princes Park | NaN |
| 4 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | STRATIOMYIDAE | NaN | NaN | Stratiomyidae 2 | Lincoln Square | NaN |
Importing dataset - tree-canopies-2021-urban-forest¶
About the dataset: The Tree Canopies 2021 - Urban Forest dataset maps the extent of tree canopy cover across the City of Melbourne using aerial imagery and LiDAR data. It provides detailed spatial insights into urban forest coverage, supporting initiatives in climate resilience, biodiversity, and urban planning.
# Importing tree canopy dataset
tree_canopy_2021 = 'tree-canopies-2021-urban-forest'
df_tree_canopy_2021 = import_data(tree_canopy_2021)
df_tree_canopy_2021.to_csv('df_tree_canopy_2021.csv', index=False) # saving into a local file
df_tree_canopy_2021_orig = df_tree_canopy_2021 #saving the original dataset
print('First few rows of the dataset:\n')
df_tree_canopy_2021.head(5)
Imported the tree-canopies-2021-urban-forest dataset with 57980 records succesfully First few rows of the dataset:
| geo_point_2d | geo_shape | |
|---|---|---|
| 0 | -37.8298681237421, 144.98303001088595 | {"coordinates": [[[[144.9832974445821, -37.829... |
| 1 | -37.829874533279096, 144.97144661356745 | {"coordinates": [[[[144.9714529379414, -37.829... |
| 2 | -37.83021396760069, 144.98646678135142 | {"coordinates": [[[[144.98647926050035, -37.83... |
| 3 | -37.828742240015515, 144.9011718210025 | {"coordinates": [[[[144.90116683929529, -37.82... |
| 4 | -37.829920930428415, 144.96518349051888 | {"coordinates": [[[[144.96517363556384, -37.82... |
Importing trees-with-species-and-dimensions-urban-forest¶
About the dataset: This dataset details the location, species and lifespan of Melbourne's urban forest by precinct. The City of Melbourne maintains more than 70,000 trees.
# Importing urban forest dataset
urban_forest = 'trees-with-species-and-dimensions-urban-forest'
df_urban_forest = import_data(urban_forest)
df_urban_forest.to_csv('df_urban_forest.csv', index=False) # saving into a local file
df_urban_forest_orig = df_urban_forest #saving the original dataset
print('First few rows of the dataset:\n')
df_urban_forest.head(5)
Imported the trees-with-species-and-dimensions-urban-forest dataset with 76928 records succesfully First few rows of the dataset:
| com_id | common_name | scientific_name | genus | family | diameter_breast_height | year_planted | date_planted | age_description | useful_life_expectency | useful_life_expectency_value | precinct | located_in | uploaddate | coordinatelocation | latitude | longitude | easting | northing | geolocation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1049657 | Unknown | Melaleuca parvistaminea | Melaleuca | Myrtaceae | NaN | 1998 | 1998-12-17 | NaN | NaN | NaN | NaN | Park | 2021-01-10 | -37.79070542406818, 144.94466634984954 | -37.790705 | 144.944666 | 319025.79 | 5815416.68 | -37.79070542406818, 144.94466634984954 |
| 1 | 1782373 | Coastal Banksia | Banksia integrifolia | Banksia | Proteaceae | NaN | 2020 | 2020-03-04 | NaN | NaN | NaN | NaN | Park | 2021-01-10 | -37.802899143753464, 144.92619307686192 | -37.802899 | 144.926193 | 317429.05 | 5814027.65 | -37.802899143753464, 144.92619307686192 |
| 2 | 1604511 | Red Box | Eucalyptus polyanthemos | Eucalyptus | Myrtaceae | NaN | 2015 | 2015-05-08 | NaN | NaN | NaN | NaN | Park | 2021-01-10 | -37.79572286091489, 144.9693861369436 | -37.795723 | 144.969386 | 321214.72 | 5814907.50 | -37.79572286091489, 144.9693861369436 |
| 3 | 1070399 | Ironbark | Eucalyptus sideroxylon | Eucalyptus | Myrtaceae | 12.0 | 2006 | 2006-12-19 | Semi-Mature | 31-60 years | 60.0 | NaN | Street | 2021-01-10 | -37.82793397289453, 144.90197974533947 | -37.827934 | 144.901980 | 315359.55 | 5811202.01 | -37.82793397289453, 144.90197974533947 |
| 4 | 1734680 | Drooping sheoak | Allocasuarina verticillata | Allocasuarina | Casuarinaceae | NaN | 2018 | 2018-09-05 | NaN | NaN | NaN | NaN | Park | 2021-01-10 | -37.792723710257945, 144.94819168988934 | -37.792724 | 144.948192 | 319341.15 | 5815199.54 | -37.792723710257945, 144.94819168988934 |
Importing 2020-building-footprints¶
About the dataset: This dataset shows the footprints of all structures within the City of Melbourne. A building footprint is a 2D polygon (or multi-polygon) representation of the base of a building or structure. The footprint is defined as the boundary of the structure where the walls intersect with the ground plane or podium, rather than an outline of the roof area (roofprint).
building_footprint = '2020-building-footprints'
df_building_footprint = import_data(building_footprint)
df_building_footprint.to_csv('df_building_footprint.csv', index=False) # saving into a local file
df_building_footprint_orig = df_building_footprint #saving the original dataset
print('First few rows of the dataset:\n')
df_building_footprint.head(5)
Imported the 2020-building-footprints dataset with 37750 records succesfully First few rows of the dataset:
| geo_point_2d | geo_shape | footprint_type | tier | structure_max_elevation | footprint_max_elevation | structure_min_elevation | property_id | structure_id | footprint_extrusion | footprint_min_elevation | structure_extrusion | roof_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -37.80037045080206, 144.9464370995547 | {"coordinates": [[[[144.94650487159868, -37.80... | Structure | 1 | 24.5 | 24.5 | 15.0 | 107105.0 | 818620 | 9.5 | 15.0 | 9.5 | Flat |
| 1 | -37.80034679637949, 144.94754591875628 | {"coordinates": [[[[144.94761503800615, -37.80... | Structure | 1 | 23.5 | 23.5 | 13.0 | 107102.0 | 805966 | 10.5 | 13.0 | 10.5 | Flat |
| 2 | -37.80027329346989, 144.94824324805376 | {"coordinates": [[[[144.94834088635758, -37.80... | Structure | 1 | 24.0 | 24.0 | 13.0 | 107100.0 | 813265 | 10.5 | 13.0 | 11.0 | Flat |
| 3 | -37.800556970621834, 144.94811576128058 | {"coordinates": [[[[144.94818490989783, -37.80... | Structure | 1 | 25.5 | 25.5 | 16.0 | 107100.0 | 813267 | 9.0 | 16.0 | 9.5 | Flat |
| 4 | -37.80167146256072, 144.94443776463496 | {"coordinates": [[[[144.9444642519418, -37.801... | Structure | 2 | 21.0 | 21.0 | 14.5 | 105780.0 | 804769 | 7.0 | 14.5 | 6.5 | Flat |
Data Cleansing and Preprocessing¶
The Data Cleansing and preprocessing phase focuses on preparing the tree canopies, insect records, urban forests and building footprint datasets for analysis. This involves resolving inconsistencies, handling missing entries, and reformatting data as needed—such as separating latitude and longitude fields, removing redundant columns, and ensuring appropriate structure across datasets. These steps are critical to harmonise the datasets for seamless integration and analysis. By standardising and validating the data, this process enhances the accuracy and reliability of any insights derived.
def split_geo_coordinates(df, geo_column):
"""
Splits a combined latitude,longitude column into two separate float columns: 'latitude' and 'longitude'.
Parameters:
- df (pd.DataFrame): The input DataFrame containing the geo column.
- geo_column (str): The name of the column with 'latitude,longitude' string values.
Returns:
- pd.DataFrame: A new DataFrame with separate 'latitude' and 'longitude' columns.
"""
if geo_column not in df.columns:
raise ValueError(f"Column '{geo_column}' not found in DataFrame.")
try:
# Ensure the geo_column is of string type
df[geo_column] = df[geo_column].astype(str)
# Attempt to split the column
split_data = df[geo_column].str.split(',', expand=True)
if split_data.shape[1] != 2:
raise ValueError(f"Column '{geo_column}' does not contain valid 'latitude,longitude' format.")
df['latitude'] = pd.to_numeric(split_data[0], errors='coerce')
df['longitude'] = pd.to_numeric(split_data[1], errors='coerce')
# Drop rows with invalid coordinates
df.dropna(subset=['latitude', 'longitude'], inplace=True)
# Drop the original geo column
df = df.drop(columns=[geo_column])
print('Dataset Info after Geo Split:\n')
print(df.info())
except Exception as e:
print(f"An error occurred during geolocation splitting: {e}")
raise
return df
def check_preprocess_dataset(df_dataset, dataset_name='dataset'):
"""
Inspects and preprocesses a dataset:
- Prints dataset info
- Checks for missing values
- Removes duplicate rows (if any)
Parameters:
- df_dataset (pd.DataFrame): The input DataFrame to be checked and cleaned.
- dataset_name (str): Optional name of the dataset for logging purposes.
Returns:
- pd.DataFrame: A cleaned version of the input DataFrame.
"""
try:
if not isinstance(df_dataset, pd.DataFrame):
raise TypeError("Input is not a pandas DataFrame.")
print(f'Dataset Information for "{dataset_name}":\n')
print(df_dataset.info())
# Check for missing values
print(f'\nMissing values in "{dataset_name}" dataset:\n')
print(df_dataset.isnull().sum())
# Identify and remove duplicates
dupes = df_dataset.duplicated().sum()
if dupes > 0:
df_dataset = df_dataset.drop_duplicates()
print(f'\nDeleted {dupes} duplicate record(s) from "{dataset_name}".')
else:
print(f'\nNo duplicate records found in "{dataset_name}".')
except Exception as e:
print(f"An error occurred while preprocessing '{dataset_name}': {e}")
raise
return df_dataset
Insect Records Dataset¶
Checking for missing values & duplicate records
df_insect_records = check_preprocess_dataset(df_insect_records, 'Insect Records Dataset')
Dataset Information for "Insect Records Dataset": <class 'pandas.core.frame.DataFrame'> RangeIndex: 1295 entries, 0 to 1294 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 taxa 1295 non-null object 1 kingdom 1295 non-null object 2 phylum 1295 non-null object 3 class 1295 non-null object 4 order 1295 non-null object 5 family 1290 non-null object 6 genus 589 non-null object 7 species 264 non-null object 8 identification_notes 1031 non-null object 9 location 1295 non-null object 10 sighting_date 0 non-null float64 dtypes: float64(1), object(10) memory usage: 111.4+ KB None Missing values in "Insect Records Dataset" dataset: taxa 0 kingdom 0 phylum 0 class 0 order 0 family 5 genus 706 species 1031 identification_notes 264 location 0 sighting_date 1295 dtype: int64 No duplicate records found in "Insect Records Dataset".
Filling Missing Taxonomic Information in Insect Records¶
In the insect dataset, we have some records with missing genus or species information. To improve the completeness of the data for better biodiversity analysis, we'll extract the missing information from the identification notes where possible.
Approach:
For missing genus names: We look in the identification notes field for words that look like genus names (words that start with a capital letter) and use these to fill in the blanks.
For missing species names: We look for patterns like "sp.1" or "sp 3" in the notes, which are common ways scientists record unidentified species within a genus. We standardise these to a consistent format (e.g., "sp1").
This process helps us create a more complete taxonomic classification, which is key for accurately measuring biodiversity across Melbourne's urban landscape.
def extract_genus_from_notes(notes: str) -> str | None:
"""Extracts the genus name from identification_notes if present.
The genus is assumed to be the first word beginning with a capital letter.
Returns None if no genus-like pattern is found.
"""
if not isinstance(notes, str):
return None
# Match the first word starting with an uppercase letter followed by lowercase letters
match = re.match(r'\b([A-Z][a-zA-Z]+)\b', notes)
return match.group(1) if match else None
def extract_species_code(notes: str) -> str | None:
"""Extracts the species number code from identification_notes.
Returns a string of the form 'sp<number>' or None if no number is found.
"""
if not isinstance(notes, str):
return None
# Look for patterns like 'sp.1', 'sp 1', 'sp. 3', etc.
match = re.search(r'\bsp\.?\s*(\d+)\b', notes, flags=re.IGNORECASE)
if match:
number = match.group(1)
return f"sp{number}"
return None
# Apply genus extraction
missing_genus_mask = df_insect_records['genus'].isna() | (df_insect_records['genus'].str.strip() == "")
df_insect_records.loc[missing_genus_mask, 'genus'] = df_insect_records.loc[missing_genus_mask, 'identification_notes'].apply(extract_genus_from_notes)
# Apply species code extraction for rows with null species or empty string
missing_species_mask = df_insect_records['species'].isna() | (df_insect_records['species'].str.strip() == "")
df_insect_records.loc[missing_species_mask, 'species'] = df_insect_records.loc[missing_species_mask, 'identification_notes'].apply(extract_species_code)
# display first 5 rows of the updated dataset
df_insect_records[['genus', 'species', 'identification_notes']].head()
| genus | species | identification_notes | |
|---|---|---|---|
| 0 | Pteromalidae | None | Pteromalidae 4 |
| 1 | Pyrgotidae | None | Pyrgotidae 1 |
| 2 | Scenopinidae | None | Scenopinidae 2 |
| 3 | Sepsidae | None | Sepsidae 1 |
| 4 | Stratiomyidae | None | Stratiomyidae 2 |
# get unique location names
locations = df_insect_records['location'].dropna().unique()
print(f"Unique locations in the dataset: {len(locations)}")
print(locations)
Unique locations in the dataset: 15 ['Fitzroy-Treasury Gardens' 'Royal Park' 'Princes Park' 'Lincoln Square' 'Pleasance Gardens' "Women's Peace Gardens" 'Carlton Gardens South' 'Westgate Park' 'Canning/Neil Street Reserve' 'Murchinson Square' 'Argyle Square' 'State Library of Victoria' 'University Square' 'Gardiner Reserve' 'Garrard Street Reserve']
Updating some values of location to correct values to retrieve accurate values of their gelocations
# Define your mapping:
mapping = {
'Fitzroy-Treasury Gardens': 'Treasury Gardens',
"Women's Peace Gardens": 'Peace Gardens',
'Canning/Neil Street Reserve': 'Canning Street Reserve',
'Murchinson Square': 'Murchison Square',
'Garrard Street Reserve': 'Gerard Street Reserve',
}
# Apply it in‐place to the location column:
df_insect_records['location'] = df_insect_records['location'].replace(mapping)
# (Optional) Verify:
print(df_insect_records['location'].unique())
# get unique location names
locations = df_insect_records['location'].dropna().unique()
['Treasury Gardens' 'Royal Park' 'Princes Park' 'Lincoln Square' 'Pleasance Gardens' 'Peace Gardens' 'Carlton Gardens South' 'Westgate Park' 'Canning Street Reserve' 'Murchison Square' 'Argyle Square' 'State Library of Victoria' 'University Square' 'Gardiner Reserve' 'Gerard Street Reserve']
# geocode each location
geolocator = Nominatim(user_agent="little_things_project")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1) # rate‑limit calls
coords = {}
for loc in locations:
query = f"{loc}, Melbourne, Victoria, Australia"
result = geocode(query)
if result:
coords[loc] = {'latitude': result.latitude, 'longitude': result.longitude}
else:
coords[loc] = {'latitude': None, 'longitude': None}
# map lat/lon back onto the dataset
df_insect_records['latitude'] = df_insect_records['location'].map(lambda x: coords.get(x, {}).get('latitude'))
df_insect_records['longitude'] = df_insect_records['location'].map(lambda x: coords.get(x, {}).get('longitude'))
df_insect_records.head(5) # display first 5 rows of the updated dataset with geocoded coordinates
| taxa | kingdom | phylum | class | order | family | genus | species | identification_notes | location | sighting_date | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Insect | ANIMALIA | ARTHROPODA | INSECTA | HYMENOPTERA | PTEROMALIDAE | Pteromalidae | None | Pteromalidae 4 | Treasury Gardens | NaN | -37.814316 | 144.975998 |
| 1 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | PYRGOTIDAE | Pyrgotidae | None | Pyrgotidae 1 | Royal Park | NaN | -37.781268 | 144.951682 |
| 2 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | SCENOPINIDAE | Scenopinidae | None | Scenopinidae 2 | Royal Park | NaN | -37.781268 | 144.951682 |
| 3 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | SEPSIDAE | Sepsidae | None | Sepsidae 1 | Princes Park | NaN | -37.783751 | 144.961831 |
| 4 | Insect | ANIMALIA | ARTHROPODA | INSECTA | DIPTERA | STRATIOMYIDAE | Stratiomyidae | None | Stratiomyidae 2 | Lincoln Square | NaN | -37.802439 | 144.962880 |
Selecting the relevant columns¶
Selecting the relevant columns and dropping the rest of the columns.
df_insect_records = df_insect_records[['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
'location','latitude', 'longitude']]
#print the columns in the
print(df_insect_records.columns)
Index(['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
'location', 'latitude', 'longitude'],
dtype='object')
Tree Canopies 2021 dataset¶
Checking for missing values & duplicate records
df_tree_canopy_2021 = check_preprocess_dataset(df_tree_canopy_2021, 'Tree Canopies 2021')
Dataset Information for "Tree Canopies 2021": <class 'pandas.core.frame.DataFrame'> RangeIndex: 57980 entries, 0 to 57979 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 57980 non-null object 1 geo_shape 57980 non-null object dtypes: object(2) memory usage: 906.1+ KB None Missing values in "Tree Canopies 2021" dataset: geo_point_2d 0 geo_shape 0 dtype: int64 No duplicate records found in "Tree Canopies 2021".
To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualisations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.
#splitting geo coordinates
df_tree_canopy_2021 = split_geo_coordinates(df_tree_canopy_2021,'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n')
df_tree_canopy_2021.head(5)
Dataset Info after Geo Split: <class 'pandas.core.frame.DataFrame'> RangeIndex: 57980 entries, 0 to 57979 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_shape 57980 non-null object 1 latitude 57980 non-null float64 2 longitude 57980 non-null float64 dtypes: float64(2), object(1) memory usage: 1.3+ MB None First few rows of the dataset after preprocessing:
| geo_shape | latitude | longitude | |
|---|---|---|---|
| 0 | {"coordinates": [[[[144.9832974445821, -37.829... | -37.829868 | 144.983030 |
| 1 | {"coordinates": [[[[144.9714529379414, -37.829... | -37.829875 | 144.971447 |
| 2 | {"coordinates": [[[[144.98647926050035, -37.83... | -37.830214 | 144.986467 |
| 3 | {"coordinates": [[[[144.90116683929529, -37.82... | -37.828742 | 144.901172 |
| 4 | {"coordinates": [[[[144.96517363556384, -37.82... | -37.829921 | 144.965183 |
Selecting the relevant columns¶
Selecting the relevant columns and dropping the rest of the columns.
df_tree_canopy_2021 = df_tree_canopy_2021[['geo_shape', 'latitude', 'longitude']]
# Print the columns in the updated dataset
print(df_tree_canopy_2021.columns)
Index(['geo_shape', 'latitude', 'longitude'], dtype='object')
Urban Forest Dataset¶
Checking for missing values & duplicate records
df_urban_forest = check_preprocess_dataset(df_urban_forest, 'Urban Forest Dataset')
Dataset Information for "Urban Forest Dataset": <class 'pandas.core.frame.DataFrame'> RangeIndex: 76928 entries, 0 to 76927 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 com_id 76928 non-null int64 1 common_name 76903 non-null object 2 scientific_name 76927 non-null object 3 genus 76927 non-null object 4 family 76927 non-null object 5 diameter_breast_height 24986 non-null float64 6 year_planted 76928 non-null int64 7 date_planted 76928 non-null object 8 age_description 24969 non-null object 9 useful_life_expectency 24969 non-null object 10 useful_life_expectency_value 24969 non-null float64 11 precinct 0 non-null float64 12 located_in 76926 non-null object 13 uploaddate 76928 non-null object 14 coordinatelocation 76928 non-null object 15 latitude 76928 non-null float64 16 longitude 76928 non-null float64 17 easting 76928 non-null float64 18 northing 76928 non-null float64 19 geolocation 76928 non-null object dtypes: float64(7), int64(2), object(11) memory usage: 11.7+ MB None Missing values in "Urban Forest Dataset" dataset: com_id 0 common_name 25 scientific_name 1 genus 1 family 1 diameter_breast_height 51942 year_planted 0 date_planted 0 age_description 51959 useful_life_expectency 51959 useful_life_expectency_value 51959 precinct 76928 located_in 2 uploaddate 0 coordinatelocation 0 latitude 0 longitude 0 easting 0 northing 0 geolocation 0 dtype: int64 No duplicate records found in "Urban Forest Dataset".
Deleting records with missing taxonomic information about the tree species as missing information cannot be generated.
# Identify records with missing taxonomic information
missing_taxonomy = df_urban_forest[
df_urban_forest['genus'].isna() |
df_urban_forest['family'].isna() |
df_urban_forest['scientific_name'].isna()
]
# Print the full record information
print(f"Found {len(missing_taxonomy)} records with missing taxonomic information:")
#deleting the records with missing taxonomic information
df_urban_forest = df_urban_forest.dropna(subset=['genus', 'family', 'scientific_name'])
# Print the delete confirmation
print(f"Deleted {len(missing_taxonomy)} records with missing taxonomic information:")
Found 1 records with missing taxonomic information: Deleted 1 records with missing taxonomic information:
Selecting the relevant columns¶
Selecting the relevant columns and dropping the rest of the columns.
df_urban_forest = df_urban_forest[['com_id', 'common_name', 'scientific_name', 'genus', 'family',
'year_planted', 'date_planted',
'latitude', 'longitude', 'easting', 'northing',
'geolocation']]
print(df_urban_forest.columns)
Index(['com_id', 'common_name', 'scientific_name', 'genus', 'family',
'year_planted', 'date_planted', 'latitude', 'longitude', 'easting',
'northing', 'geolocation'],
dtype='object')
Building Footprint Dataset¶
Checking for missing values & duplicate records
df_building_footprint = check_preprocess_dataset(df_building_footprint, 'Building Footprint Dataset')
Dataset Information for "Building Footprint Dataset": <class 'pandas.core.frame.DataFrame'> RangeIndex: 37750 entries, 0 to 37749 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 37750 non-null object 1 geo_shape 37750 non-null object 2 footprint_type 37750 non-null object 3 tier 37750 non-null int64 4 structure_max_elevation 37750 non-null float64 5 footprint_max_elevation 37750 non-null float64 6 structure_min_elevation 37750 non-null float64 7 property_id 37736 non-null float64 8 structure_id 37750 non-null int64 9 footprint_extrusion 37750 non-null float64 10 footprint_min_elevation 37750 non-null float64 11 structure_extrusion 37750 non-null float64 12 roof_type 37750 non-null object dtypes: float64(7), int64(2), object(4) memory usage: 3.7+ MB None Missing values in "Building Footprint Dataset" dataset: geo_point_2d 0 geo_shape 0 footprint_type 0 tier 0 structure_max_elevation 0 footprint_max_elevation 0 structure_min_elevation 0 property_id 14 structure_id 0 footprint_extrusion 0 footprint_min_elevation 0 structure_extrusion 0 roof_type 0 dtype: int64 No duplicate records found in "Building Footprint Dataset".
No duplicate records found in "Building Footprint Dataset".
Analysing the missing information and discovered that missing property_id is only for footprint_type as 'Bridge', hence concluded that this can be retained as is.
# Count unique values in property_id and structure_id columns
property_count = df_building_footprint['property_id'].nunique()
structure_count = df_building_footprint['structure_id'].nunique()
# Count unique combinations of property_id and structure_id
combined_count = df_building_footprint.groupby(['property_id', 'structure_id']).ngroups
print(f"Number of unique property IDs: {property_count}")
print(f"Number of unique structure IDs: {structure_count}")
print(f"Number of unique property ID and structure ID combinations: {combined_count}")
# Find rows where property_id is null
null_property_mask = df_building_footprint['property_id'].isna()
# Extract structure_ids where property_id is null
structure_ids_with_null_property = df_building_footprint.loc[null_property_mask, 'structure_id'].tolist()
# Print the count and values
print(f"Found {len(structure_ids_with_null_property)} structures with null property IDs:")
print(structure_ids_with_null_property)
# More efficient approach: Filter once for all structure IDs in the list
filtered_df = df_building_footprint[df_building_footprint['structure_id'].isin(structure_ids_with_null_property)]
selected_columns = filtered_df[['structure_id', 'property_id', 'footprint_type']]
# Show results grouped by structure_id
print(f"\nData for all {len(structure_ids_with_null_property)} structures with null property IDs:")
print(selected_columns.sort_values(by='structure_id'))
Number of unique property IDs: 14102
Number of unique structure IDs: 19018
Number of unique property ID and structure ID combinations: 19153
Found 14 structures with null property IDs:
[802056, 802066, 802067, 802059, 802063, 802060, 802057, 802062, 802069, 802061, 802064, 802065, 802068, 802058]
Data for all 14 structures with null property IDs:
structure_id property_id footprint_type
1224 802056 NaN Bridge
14949 802057 NaN Bridge
32411 802058 NaN Bridge
8975 802059 NaN Bridge
12496 802060 NaN Bridge
22107 802061 NaN Bridge
14950 802062 NaN Bridge
8976 802063 NaN Bridge
22108 802064 NaN Bridge
22109 802065 NaN Bridge
1225 802066 NaN Bridge
1226 802067 NaN Bridge
30783 802068 NaN Bridge
14951 802069 NaN Bridge
To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualisations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.
df_building_footprint = split_geo_coordinates(df_building_footprint, 'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n')
df_building_footprint.head(5)
Dataset Info after Geo Split: <class 'pandas.core.frame.DataFrame'> RangeIndex: 37750 entries, 0 to 37749 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_shape 37750 non-null object 1 footprint_type 37750 non-null object 2 tier 37750 non-null int64 3 structure_max_elevation 37750 non-null float64 4 footprint_max_elevation 37750 non-null float64 5 structure_min_elevation 37750 non-null float64 6 property_id 37736 non-null float64 7 structure_id 37750 non-null int64 8 footprint_extrusion 37750 non-null float64 9 footprint_min_elevation 37750 non-null float64 10 structure_extrusion 37750 non-null float64 11 roof_type 37750 non-null object 12 latitude 37750 non-null float64 13 longitude 37750 non-null float64 dtypes: float64(9), int64(2), object(3) memory usage: 4.0+ MB None First few rows of the dataset after preprocessing:
| geo_shape | footprint_type | tier | structure_max_elevation | footprint_max_elevation | structure_min_elevation | property_id | structure_id | footprint_extrusion | footprint_min_elevation | structure_extrusion | roof_type | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {"coordinates": [[[[144.94650487159868, -37.80... | Structure | 1 | 24.5 | 24.5 | 15.0 | 107105.0 | 818620 | 9.5 | 15.0 | 9.5 | Flat | -37.800370 | 144.946437 |
| 1 | {"coordinates": [[[[144.94761503800615, -37.80... | Structure | 1 | 23.5 | 23.5 | 13.0 | 107102.0 | 805966 | 10.5 | 13.0 | 10.5 | Flat | -37.800347 | 144.947546 |
| 2 | {"coordinates": [[[[144.94834088635758, -37.80... | Structure | 1 | 24.0 | 24.0 | 13.0 | 107100.0 | 813265 | 10.5 | 13.0 | 11.0 | Flat | -37.800273 | 144.948243 |
| 3 | {"coordinates": [[[[144.94818490989783, -37.80... | Structure | 1 | 25.5 | 25.5 | 16.0 | 107100.0 | 813267 | 9.0 | 16.0 | 9.5 | Flat | -37.800557 | 144.948116 |
| 4 | {"coordinates": [[[[144.9444642519418, -37.801... | Structure | 2 | 21.0 | 21.0 | 14.5 | 105780.0 | 804769 | 7.0 | 14.5 | 6.5 | Flat | -37.801671 | 144.944438 |
Selecting the relevant columns¶
Selecting the relevant columns and dropping the rest of the columns.
df_building_footprint = df_building_footprint[['geo_shape', 'footprint_type', 'tier', 'structure_max_elevation',
'structure_min_elevation', 'property_id', 'structure_id', 'latitude',
'longitude']]
print(df_building_footprint.columns)
Index(['geo_shape', 'footprint_type', 'tier', 'structure_max_elevation',
'structure_min_elevation', 'property_id', 'structure_id', 'latitude',
'longitude'],
dtype='object')
Data Analysis and Visualisation¶
In this section, we explore Melbourne's urban biodiversity using interactive maps and charts. These visualisations help us understand the distribution of trees, insects, and buildings across the city, making it easier for everyone to see where nature thrives and where improvements can be made.
Each visualisation is explained in simple terms, so you can easily interpret what the data shows and how it relates to the health and connectivity of our urban environment.
Tree Canopy Map¶
This map shows the spread of tree canopies across Melbourne. Each green area represents the coverage of tree leaves and branches, which provide shade, cool the city, and support wildlife.
How to read this map:
- Larger green areas mean more tree cover, which is good for the environment and people.
- Smaller or missing green areas highlight places that may need more trees or greening.
By looking at this map, you can easily spot which parts of the city are well-covered by trees and which areas could benefit from more planting.
# Convert Tree Canopies dataset into a GeoDataFrame if not already done
if 'geometry' not in df_tree_canopy_2021.columns:
df_tree_canopy_2021['geometry'] = df_tree_canopy_2021['geo_shape'].apply(lambda x: shape(json.loads(x)))
# Create GeoDataFrame
gdf = gpd.GeoDataFrame(df_tree_canopy_2021, geometry='geometry', crs='EPSG:4326')
# Project to Web Mercator for compatibility with contextily basemaps
gdf_projected = gdf.to_crs(epsg=3857)
# Create the plot
fig, ax = plt.subplots(figsize=(14, 12))
# Plot tree canopy data
gdf_projected.plot(ax=ax, color='green', edgecolor='darkgreen', alpha=0.7,
label='Tree Canopies')
# Add the contextily basemap
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)
# Add title and labels
ax.set_title('Tree Canopy Coverage (2021) - Melbourne', fontsize=14)
ax.set_xlabel("Longitude", fontsize=12)
ax.set_ylabel("Latitude", fontsize=12)
# Improve readability
plt.tight_layout()
# Show the plot
plt.show()
Insect Diversity Charts¶
These charts show the variety of insect species found in different parts of Melbourne. Each bar or section represents a group of insects, helping us see which areas have the most diversity.
How to read these charts:
- Taller bars or larger sections mean more types of insects are present, which is a sign of a healthy ecosystem.
- Shorter bars or smaller sections show fewer species, which may mean the area needs more habitat support.
By understanding insect diversity, we can identify places that are rich in life and those that could benefit from more conservation efforts.
# Count unique insect species per location
species_by_location = df_insect_records.groupby('location')['genus'].nunique().sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x=species_by_location.index, y=species_by_location.values, palette='viridis')
plt.xticks(rotation=90)
plt.xlabel('Location')
plt.ylabel('Number of Unique Insect Genera')
plt.title('Insect Diversity Across Melbourne Locations')
plt.tight_layout()
plt.show()
Insect Habitat Density Map This map displays the distribution of insect habitats across Melbourne’s green spaces. Each blue circle marks a site where insect surveys took place. How to read this map:
- Larger circles indicate parks or reserves with a higher number of insect observations, pointing to richer biodiversity at those locations.
- Smaller circles denote sites where fewer insects were recorded, suggesting areas that might benefit from targeted ecological improvements.
By examining this map, you can quickly identify which parts of the city are most supportive of insect life and which areas could be strengthened to enhance urban biodiversity.
# compute density (number of insect records) at each site
density_by_location = df_insect_records.groupby('location').size().rename('density')
# aggregate to one row per location, including coordinates
site_summary = (
df_insect_records
.groupby('location')
.agg({
'latitude': 'mean',
'longitude': 'mean'
})
.reset_index()
)
# merge density values into the summary
site_summary = site_summary.merge(
density_by_location,
left_on='location',
right_index=True
)
# build the Folium map centred on Melbourne
m = folium.Map(location=[-37.81, 144.96], zoom_start=13)
# add circle markers with radius scaled by density
for _, row in site_summary.iterrows():
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=4 + row['density'] * 0.1, # adjust multiplier to suit
tooltip=row['location'],
fill=True,
fill_opacity=0.6
).add_to(m)
# display map
m
Building Footprint Map¶
This map displays the locations and types of buildings throughout Melbourne. Different symbols and colours are used to show the type of building footprint, making it easy to see the variety of structures in the city.
How to read this map:
- Each point or shape represents a building or structure.
- The colour or symbol tells you what type of footprint it is, such as residential, commercial, or other.
- Areas with lots of buildings may act as barriers to wildlife movement, while open spaces can help connect habitats.
By viewing this map, you can understand how buildings are spread out and how they might affect the movement of animals and plants in the city.
# Define colour mapping for footprint types
footprint_colours = {
'Structure': 'grey',
'Bridge': 'red',
'Tram Stop': 'orange',
'Jetties': 'green',
'Ramp': 'purple',
'Toilet': 'pink',
'Train Platform': 'brown'
}
# Create a base map centred on Melbourne
m_buildings = folium.Map(location=melbourne_coords, zoom_start=16)
for idx, row in df_building_footprint.iterrows():
# Assign color based on footprint_type, default to 'blue' if not found
colour = footprint_colours.get(row['footprint_type'], 'blue')
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=0.5,
color=colour,
fill=True,
fill_opacity=0.2,
popup=f"Type: {row['footprint_type']}"
).add_to(m_buildings)
# Add legend using branca.element
legend_html = """
{% macro html(this, kwargs) %}
<div style="
position: fixed;
bottom: 50px; left: 50px; width: 180px; height: 180px;
z-index:9999; font-size:14px;
background: white; border:2px solid grey; border-radius:8px; padding: 10px;">
<b>Legend</b><br>
<i style="color:grey;">●</i> Structure<br>
<i style="color:red;">●</i> Bridge<br>
<i style="color:orange;">●</i> Tram Stop<br>
<i style="color:green;">●</i> Jetties<br>
<i style="color:purple;">●</i> Ramp<br>
<i style="color:pink;">●</i> Toilet<br>
<i style="color:brown;">●</i> Train Platform<br>
<i style="color:blue;">●</i> Other
</div>
{% endmacro %}
"""
macro = MacroElement()
macro._template = Template(legend_html)
m_buildings.get_root().add_child(macro)
m_buildings